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I. 



INTRODUCTION 



As the complexity of man’s machines increases, so does 
the need for simple, efficient man-machine interfaces. 
Automatic speech recognition plays a major role in this man- 
machine communication because of the superiority of speech 
over other modes of human communication. Speech is the most 
familiar and most convenient way for humans to communicate. 
Voice input leaves the hands and eyes of the operator free 
to perform other tasks and allows speaker mobility. 

Word recognition is one facet of the research conducted 
in the area of speech processing. Speech processing can be 
divided into three major categories. The speech analysis 
area includes word recognition, speaker identification, and 
speaker verification. The second category is speech 
synthesis. An example of synthesis is a data-retrieval 
system, where the computer responds verbally when its data 
base is interrogated. Another example is when a child 
receives a verbal response from his toy informing him he has 
correctly answered a question. The third area is a 
combination of the first two, speech analysis followed by 
speech synthesis. This has application in secure voice 
transmission and speech data rate reduction. As an example 
of the latter, the telephone company requires 64K bits/sec 
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to transmit speech. The Department of Defense standard for 
data rate reduction is 2.4K bits/sec. The Air Force is 
experimenting with data rates as low as 150 bits/ sec which 
provides intelligible speech. 

The advent of the general purpose digital computer 
in the mid-1960s provided speech researchers with a 
powerful tool. Numerous speech processing algorithms 
using digital signal process techniques have been developed 
for both analysis and synthesis. From using dynamic 
programming to time-warp speech prior to processing, to 
algorithms for extracting parameters to be used for speech 
synthesis, speech processing is a billion dollar a year 
business . 

Various speaker-dependent word recognition systems are 
commercially available. These systems generally perform 
some type of spectral analysis on the incoming speech 
signal. The recognition process involves classical pattern 
recognition techniques. These systems have a very high rate 
of successful recognition. 

The success of these systems notwithstanding, the 
problem of constructing a speaker-independent recognition 
system remains unsolved. The solution to this problem 
involves determining what features of speech contain the 
information and hence are speaker independent. Before one 
can talk about extracting the information content from the 
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speech signal, a look at a model of how humans produce 
speech is in order. 

A. FUNDAMENTALS OF SPEECH 

Flanagan [Refs. 1 and 2] formulated a generally accepted 
model for human speech production. His model describes the 
vocal tract as a nonuniform acoustic tube connecting the 
vocal cords and the lips. In an adult male the vocal tract 
is approximately 17 cm. in length. 

The vocal tract can be connected to an ancillary cavity 
called the nasal cavity. The coupling is accomplished 
through a trapdoor mechanism called the velum. The nasal 
cavity begins at the velum and terminates at the nostrils. 

In an adult it is about 12 cm. long. When non-nasal sounds 
are produced the velum closes, thereby sealing off the nasal 
cavity. 

Humans are capable of producing two types of sounds, 
voiced and unvoiced. In the case of voiced sounds air moves 
over the vocal cords causing them to vibrate in a quasi- 
periodic fashion. Unvoiced sounds are generated by either 
forming a constriction in the tract and forcing the air 
through at high velocity or by allowing pressure to build up 
behind the closure and then releasing it suddenly. The name 
fricative is associated with the former while plosive is the 
name given to the latter. 
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Since the physical configuration of the vocal tract 



changes with time, Flanagan's 
linear time-varying system as 



model can be represented as a 
shown in Figure 1.1. 





Time Varying 


y(t) 


x(t) 


Filter 




v(t) 

i 





Figure 1.1. Model of Speech Production 

If it is assumed that the vocal tract changes slowly 
with time the output can be approximated by the short-term 
convolution of the excitation, x(t), and the vocal tract 
impulse response, v(t). For voiced sounds x(t) is 
quasiper iod ic hence the output y(t) is also quasiperiod ic . 
For the unvoiced case the excitation x(t) is random and is 
generally approximated by white noise. 

If the vocal tract impulse response of an individual 
could be obtained, then using the time varying linear 
system model intelligible speech should be able to be 
generated. The excitation would either be periodic or 
random depending on whether voiced or unvoiced sounds are 
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desired. Figure 1.2 is a simplified speech synthesis 
machine where the vocal tract parameters are stored in the 
RAM and downloaded to the voice synthesis chip which is 
excited by either the periodic or the random signal. This 




i 

Controller 



Figure 1.2. Voice Synthesis 

type of speech synthesis arrangement is the basis for Texas 
Instruments’ (TI) Speak and Spell toys. TI can custom 
manufacture a speech synthesis chip which will emulate 
anyone's voice for $15,000. 
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These voiced and unvoiced sounds are combined in a 
unique fashion to form phonemes, the basic building blocks 
of language. All languages can be reduced to a finite 
number of these distinguishable building blocks. Phonemes 
are of such fundamental importance that if one phoneme is 
exchanged for another the meaning of an utterance is 
completely altered. 

Thus, in theory, if a machine could be designed to 
disassemble utterances into their phoneme components the 
speech recognition problem would be completely solved. 
Despite vast amounts of time, effort, and money expended, 
however, the phoneme disassembler is years away from 
becoming an appears to be reality. 

B. SPEECH RECOGNITION MACHINES 

While the phoneme disassembler does not exist, several 
types of speech recognition systems are commercially 
available. The majority of these systems are classified as 
isolated word recognizers. As the name implies the systems 
are designed to recognize isolated words. The vocabulary of 
these machines is usually limited to 100-300 words and these 
systems are extremely speaker dependent. Thus, a person 
desiring to use these machines must first train the machine 
to recognize his voice. During the training phase the 
speaker’s utterances are processed and templates formed. 

The recognition process involves comparing the incoming 



12 



utterance with those templates stored in the machine's 
memory [Ref. 3]. Although these machines have a limited 
vocabulary and cannot recognize connected or conversational 
speech, they are extremely useful for inventory control, 
quality assurance control, or for a pilot to check the 
systems in a combat aircraft. In all these instances the 
vocabulary is limited, the speaker is known, and voice data 
entry frees the individual to perform other tasks. 

ITT has developed a word recognition system for the Air 
Force's F-16 fighter. The system is capable of recognizing 
300 words and allows the pilot to check the status of 
certain systems while he maintains two hand control of the 
plane. This two-hand control is particularly important 
during low level, high speed attack runs. The pilots 
up-date their voice patterns monthly or if their voice 
changes due, say, to a cold. The patterns are stored in a 
bubble memory and inserted into the system prior to 
take-off. The microphone is located inside the pilot's 
oxygen mask and the system status is displayed on the 
cockpit's CRT. At a recent demonstration of this system it 
had a correct recognition rate of 99%. 

The NPS Speech Processing Laboratory acquired an iso- 
lated word recognition system for experimentation purposes. 
The system is the VRM Voterm-2 manufactured by Interstate 
Electronics Corporation. The system, acquired in 1981, 
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weighs 10 lbs. and cost $2500. Today the same system has 
been reduced to a four chip set, for a cost of $1000. 

The operation of the VRM is typical of the word 
recognition systems currently available [Ref. 4]. It allows 
the user to select the vocabulary size, decision threshold 
and number of training passes. It also allows for reference 
pattern transfer between itself and the host computer. The 
host computer serves only as a mass storage device and 
controller. All processing and recognition is performed 
real-time by the VRM. 

The input speech signal is analyzed by a 16-filter 
analog spectrum analyzer and then passed through an A/D 
converter. This digitized speech data is then converted to 
a fixed-size (120 bit) pattern that preserves the informa- 
tion content of the utterance. During the training phase 
the VRM rejects utterances that do not sufficiently agree 
with previous training samples of the word. This rejection 
leads to a reduction of the number of ’ones’ stored in the 
pattern. After seven training passes the pattern contains 
approximately one hundred ’zeroes’. 

In 1980, NATO and the Rome Air Development Center (RADC) 
[Ref. 5] conducted a comparison test on three isolated word 
recognition systems. The vocabulary used consisted of the 
ten single digits of the respective languages of the 
speakers. The machines evaluated were the VRM system, the 



14 



Threshold Technology 8040 Preprocessor (cost $50,000) and 
the Nippon Electric DP-100 (cost $60,000). 

Table 1.1 lists the results from the RADC test [Ref. 6]. 
Each speaker trained the machines by repeating each digit 
ten times. No attempt was made to introduce speakers who 
had not trained the machine. However, tests run at the 
Speech Processing Lab with the VRM with some non-trained 
speakers, using the ten digits and three sets of reference 
patterns the successful recognition rate for new speakers 
was less than 30%. 

Thus, these systems work extremely well for what they 
were designed to accomplish. As previously stated, the 
basic question of what parameters of speech are speaker 
independent still remains unanswered. Numerous theories 
have been proposed and all have been unsuccessful. There is 
a lack of understanding of the human mechanisms used in 
understanding speech. 
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RECOGNITION PERCENTAGES FOR RADC/NATO TEST 



CO 



a; 


O 


0 


0 


O 


O 


O 


O 


0 


O 




O 


0 


0 


O 


O 


O 


O 


0 


O 


o u 


vO 


0 


OvJ 


vO 


O 


vO 


vO 


CM 


O 


UJ ZD 


on 


on 


T“ 




T— 


vO 


on 


oc 


CM 
























=r 


0 


c— 


on 


O 


^r 


c — 


vO 


m 


** 


vO 


0 


vO 


00 


O 


0- 


O' 


OO 


vO 


0 a 




















uj a) 




0 






O 




O' 


O' 




2 : a: 




0 


O' 




O 


cn 


O' 


O' 
























x: 




















CO 




















a> 




















CO 




















i- 


0 


0 


0 


0 


O 


0 


0 


O 


0 


E — 1 <D 


0 


0 


0 


0 


O 


0 


0 


O 


0 


-P 




in 


00 


O' 


un 


cn 


=T 


on 


0 


ST -P 


in 




T— 




T— 




LO 


CM 


on 


CC ZD 
















T 




> 




















x: 


C\J 


O' 


OO 


CT 


on 


=r 


LO 




0 


CO 


0 


vO 


C\J 


=r 


* — 


vO 




-=T 


c- 


O) ** 




















0 


c* 


CO 


o\ 


in 


00 


00 


OO 


OO 


00 


jC <D 


o' 


O' 


O' 


O' 


O' 




O' 


O' 


O' 


E-h CC 






















in 


OO 


CM 


CM 


0 




l n 


=r 


0 


** 


C- 


C- 


CM 


CM 


CM 


0 


00 


on 


vO 


s 0 




















CC 0) 


OO 


OO 


O' 


on 


vO 


00 




00 


vO 


> CC 


C* 


C* 


O' 




O' 


O' 


a\ 


O' 


a\ 


CO 




















£- 




















0) 








































03 




















a; 


OO 


O' 


Z7 


T 


on 


O' 


0 


T 


00 


a 












T 


T— 


CM 




00 








































<L> 




















i- > 




















O *H 




















4J» 




















a; 03 


0) 




O 




a; 


0) 








> 2 


> 




> 




> 


> 








•H 1 


•H 




•H 




•H 


•H 








4-> C 


4-> 


c 


-P 


c 


-P 


P> 


c 


iH 


rH 


03 O 


03 


0 


03 


0 


03 


03 


0 


iH 


iH 


2: 2: 


S 


2 : 




2 : 


2 


2: 




< 


<C 




















0 ) 




















iH 


cu 
















a; 


03 


bO 


x: 


jc 












iH 


E 


03 C 


CO 


CO 


x: 


x: 








03 


CD 


3 <U 


•rH 


*H 


a 


0 


■C 






s; 


U-. 


.bO^ 


rH 


rH 


c 


c 


O 






N 


N 


c 0 


bO 


bO 


0 ) 


0 ) 


x> 


iH 


iH 


iH 


rH 


03 Q- 


C 


c 


i- 




3 


iH 


iH 


rH 


rH 


J GO 


UJ 


UJ 


Cj-. 




Q 


< 




<£ 


< 



16 



II. MODELS OF THE EAR 



For a long time people have been trying to understand 
how the human ear functions. In the first century 3.C., the 
Roman poet, philosopher Lucretius postulated a model 
"involving little grains of sand in the inner ear responding 
too different tones" [Ref. 7]. The 18th century Italian 
violinist Tartini noted that the ear produced a third tone 
from two tones played simultaneously. Thus the long held 
belief that the ear was a linear device was demonstrated to 
be false. Today the ear is thought to be a nonlinear device 
even at power levels near the threshold of hearing. 

The first concentrated research into the process of 
hearing did not begin until the mid-1800 f s. This was the 
time of Seebeck, Helmholtz, and Ohm. It was Ohm who 
postulated a now famous law on the relationship of speech 
and its phase angle. He stated that all the information 
content of speech is contained in its power spectrum and was 
independent of the phase angle of the components. Although 
Ohm’s law has been modified in recent years, it remains as 
one of the fundamental laws of psychoacoustics. 

The ear can be broken down into three physical areas ; 
the outer, middle and inner ear. Sound waves impinge on the 
outer ear and are conducted down a canal until they reach 
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the middle ear. The middle ear contains three tiny bones. 

The alternate compressions and refractions of the speech 
wave cause the eardrum to strike the bones. In the inner 
ear the wave travels along a thin membrane whose frequency 
response varies between 100 Hz and 20 KHz. This provides 
for spectral analysis of the incoming signal. 

The membrane of the inner ear is lined with tiny hairs. 

It is these hairs or more correctly groups of hairs that 
perform the spectral analysis. Recent studies at the 
California Institute of Technology [Ref. 8] have found that 
each tiny hair bundle consists of 30-150 thin, rod-shaped 
extensions called cilia. These hair bundles are attached to 
hair cells. The hair cells are very sensitive transducers 
which convert the movement of the hair bundle into an elec- 
trical signal which is sent to the brain. The hair bundle- 
hair cell combination form a sort of mechanical spectrum 
analyzer . 

Manfred Schroeder [Ref. 9] describes an experiment in 
which the inner ear’s sensitivity to phase was demonstrated. 
The experiment was as follows: 

1) A 100 sec. sample of speech was Fourier transformed. 

2) Random phase angles were assigned to the frequency 
components (assuming a uniform distribution 0 to 2ir). 

3) The inverse Fourier transform was taken. 

The resultant signal sounded like white noise. Thus by 
randomizing the phase angles the signal was transformed from 
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intelligent speech to noise. This lent credence to the 
hypothesis that the inner ear was phase sensitive and that 
Ohm's law, if not wrong, was at least in need of modifica- 
tion. The experiment was repeated this time using a 50 
msec, sample of speech. The resultant signal was non- 
intelligible noise. Ohm's law modified to say that only the 
short term amplitude spectrum contained the speech 
information appeared to be correct. 

Ohm based his law on a model of the ear that said: 

1) The ear has a tuned bandpass filter covering the 
audio range. 

2) Only the output amplitude of each filter is sent to 
the brain. 

Today the most likely candidate for the bandpass filter are 
the hair bundle-hair cell combinations that respond to only 
selected stimuli. 

In 19^7 an experiment was conducted [Ref. 10] in an 
effort to obtain a definite answer to the phase sensitive 
question. An AM signal at 2000 Hz was modulated by a 100 Hz 
signal. Thus three frequency components (1900 Hz, 2000 Hz, 
2100 Hz) were present. One of the sidebands had its phase 
shifted by 180°. This phase shift resulted in what was 
termed a quasi-FM (QFM) signal. Upon listening to the 
signals there was a noticeable difference between the AM and 
QFM signal. Thus there was a revived interest in the ear's 
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capability to discern waveforms and not just their 
amplitude . 

In a further effort to determine to what extent phase is 
important in discerning speech, Hall and Schroder [Ref. 11] 
conducted an experiment where the phase angle of one of two 
pure tones was changed. Specifically two tones one at 200 
Hz and 0° and another at 400 Hz but with phase angles of 0°, 
60°, 120°, 180°, 240°, and 300° were listened to, three 
signals at a time. The listeners’ task was to determine 
which two signals sounded most alike and which two sounded 
least alike. The results showed that those harmonics of 
400 Hz whose phase angle differed the least were judged to 
be the most similar consistently. 

About twelve years prior to this experiment researchers 
at Bell Labs postulated that the phase dependency seen in 
experiments involving the inner ear could be traced to the 
phase dependence of the inner and middle ear distortion 
products. Due to the presence of these nonlinear distortion 
products a new spectrum, called the inner spectrum was 
formed in the inner ear. It is this spectrum that is 
analyzed by the hair bundles of the inner ear. 

This theory certainly would explain what happened at 
Bell Labs during a 1958 experiment [Ref. 12]. When the 
phase of one of 31-equal amplitude harmonics all 0° phase 
was changed to a 180° a pure tone was heard. This tone was 
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not heard when the signal was put through a loud speaker . 

Thus using the inner spectrum theory changing the phase of 
one harmonic to 180° altered the amplitude of one of the 
distortion products. This altered the inner spectrum 
causing a bump in the spectrum where previously it had been 
flat . 

In Germany, Terhardt and Fasti [Ref. 133 conducted 
experiments trying to connect frequency difference and phase 
angles. They formed a signal s(t) = a^ cos ( 2 ir f ^ t ) + a 2 cos 
(2irf 2 t-$ ? ) where f 1 = 200 Hz, f 2 = 400 Hz and asked lis- 
teners to adjust the amplitude of each component so the 
400 Hz tone was just audible. This was to be done while the 
phase angle, <+> 2 , of the 400 Hz tone was changed. The 
results showed that when <j> 2 was changed from 0° to 180°, the 
amplitude of the 400 Hz signal had to be increased by 12 dB 
to remain audible. 

Yet another theory on the functioning of the ear came out 
of this experiment. The researchers theorized that the hair 
cells of the ear were discerning the time between successive 
spikes in the waveform and passed this information to the 
brain. This appeared as a reasonable explanation as when <j> 2 
= 0° the time between successive spikes was 2.5 msec. With 
<t> 2 = 180° the time between spikes was 5 msec., unless the 
amplitude of the 400 Hz tone was increased by considerable 
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amount. With the amplitude increased the small spikes at the 
2.5 msec, mark would increase dramatically. 

This theory is consistent with the physiology of the 
ear. All the electric pulses transmitted to the brain from 
the hair cells have approximately the same amplitude, thus 
the timing between the pulses is the information that they 
carry . 

From the myriad of theories presented it is easy to 
conclude that a definitive model of the human ear is non- 
existent. The fact that phase contains some information 
content has been demonstrated. Whether phase alone is the 
speaker independent feature that researchers are looking for 
remains an unanswered question. Experiments conducted in 
the late 1970's and 1980's using phase-only representations 
of speech have given some creditability to the hypothesis 
that phase must be included as one of the speaker indepen- 
dent features of speech. 
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III. PHASE-ONLY REPRESENTATIONS OF SPEECH 



Recapitulating, Ohm's law stated that all the 
information content of speech could be obtained from the 
short term power spectrum and that phase angle of the 
components was meaningless. Thus, in the short term the ear 
is phase deaf. Oppenheim [Ref. 14] sought to explore more 
fully the importance of phase in speech. 

Given the Fourier transform of a speech signal 

FU) = | F ( w) | 9 ( u } (3-D 

and if the |F(<o)| is set equal to one, the inverse transform 
of is a phase only representation of the speech. 

This phase only representation retained total intelligi- 
bility, while exhibiting the characteristics of being high 
passed filtered and having white noise added. The magnitude 
only representation was speech-like in its appearance but 
was not intelligible. 

Oppenheim concluded that transforming a signal to its 
phase only form was equivalent to passing it through a 
spectral whitening process with a filter whose response is 
H(x) = 1 / | F ( x ) | , where F(x) is the Fourier transform of the 
original signal. This spectral whitening did not destroy 
the intelligibility of the speech. 
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Contrary to Ohm’s law, Cox and Robinson [Ref. 15] 
conducted a series of four experiments which preserve the 
short term phase of a speech signal while either destroying 
or severely distorting the amplitude. These phase-only 
signals were found to retain many speech characteristics and 
were intelligible to the listeners. Hence under certain 
transformations short term phase may be one of the physical 
invariants of speech. 

The experiments used a speech signal that was analog 
band limited to 8 KHz and sampled at a rate of 20 KHz with 
12 bits A/D. Successive 25.6 msec windows, corresponding to 
512 data points, were fast Fourier transformed. Nonlinear 
operations were applied to each data set, and the inverse 
fast Fourier transforms were taken yielding 25.6 msec of 
reconstructed speech signal. These signals were D/A 
converted at a rate of 20 KHz and passed through a 8 KHz low 
pass analog filter. Only rectangular windows were used and 
no attempt was made to fit the windows together since 
amplitude of the reconstructed signal was umimportant. The 
first two experiments are included for completeness only. 

The latter two are the concern of this thesis. 

A. SHORT-TERM PHASE ONLY SIGNALS 

This experiment basically repeated the previously 
mentioned work of Oppenheim, as the magnitude of the Fourier 
transform of the data sets was set equal to one. The phase 
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was unchanged. The reconstructed short-term phase only 
signal was found to retain many of the original waveform's 
features. Listeners could identify speaker dependent 
characteristics and the intelligibility, while not judged 
good, was likened to a signal containing a lot of noise. 
There was no attempt made by the researchers to clean up the 
signal. The results of this experiment clearly are contrary 
to Ohm's law and demonstrate that short-term phase only 
speech is intelligible. 

B. ANALYTIC SIGNAL PROCESSING 

The second experiment was a repeat of one carried out in 
the late 19^0 's. Here the representation is an infinitely 
clipped version of the original signal 

Sc ( t ) = Sgn [ s ( t ) ] (3.2) 

where s(t) is the original signal, and Sgn is defined to be 
the sign of s(t). Thus the continous valued signal, s(t), 
was transformed into a discrete valued signal. The 
transformation retains only the real-zero information of 
s(t). That is, if s(t) was an analytic signal the real- 
zeros mark the time when the phase was changed by 180°. The 
intelligibility of such a signal was not commented on by the 
experimenters, however, they did say that large amounts of 
speech information were retained using this transform. 
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C. DIRECT PHASE CEPSTRUM 



The concept of cepstral analysis of speech was developed 
by Oppenheim [Ref. 16] and is an example of a broad class of 
nonlinear processing called homomorphic processing. These 
homomorphic systems obey generalized laws of superposition. 
If x^(n) and X 2 (n) are inputs to a homomorphic system and 
y-j(n), y 2 (n) are corresponding outputs and k is any scalar 
then 

y <1 (n) = ♦Cx 1 (n) ] 
y 2 (n) = $[x 2 (n)] 

*[x<](n) A x 2 (n)] = <^[x^(n)] □ <>[x 2 (n)] 

<t>Ck O x ^ (n) ] = k * y-j(n) 

where A , Q , O > and * are mathematical operations. 

The importance of these homomorphic systems is that $ 
can be broken down into a cascade of operations as shown in 

_ i 

Figure 3.1 where A , A are inverses of each other and L 
° o ’ o 

is a simple linear filter. 

Thus Oppenheim [Ref. 17] formulated a model for the 
production of speech as shown in Figure 3-2. The model is 
based on the assumption that the excitation and vocal tract 
parameters are independent. The source of excitation for 
the voiced sounds is the impulse generator whose period is 
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Figure 3*1* Homomorphic System 



Pitch Period 




Speech 

Samples 



Digital 

Filter 

Coefficient 



Figure 3*2. Model for Speech Production 
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controlled by the pitch-period signal. The impulse 
generator produces an impulse once every N q samples, where 
N q is the pitch-period and 1 /N q is the pitch frequency. The 
unvoiced excitation is from the random number generator and 
simulates both fricative and plosive sounds. The digital 
filter is assumed to be slowly varying with time and hence 
changes its coefficients once every 10 msec. The amplitude 
control simply adjusts the output level of the speech. 

Using this model the output digitized speech waveform 
consists of the convolution of 

(1) The train of impulses representing the pitch 

(2) The excitation pulse 

(3) The vocal tract impulse reponse. 

If x(n) denotes the output signal, then 

x(n) = [p(n) * e(n) * u(n)] w(n) (3.3) 

where p(n) is the train of pitch pulses, e(n) is the 
excitation pulse, u(n) the vocal tract impulse response, and 
w(n) the window through which the speech is viewed. The 
window w(n) is smooth, hence we can define 

p(n) = p(n) w(n) (3* 4) 
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Then substituting this into equation ( 3 • 3 ) it is possible to 
approximate x(n) by 

A 

x ( n) = p(n) * e(n) * u(n) (3.5) 

Examining equation (3*5) it is possible to convert the 
triple convolution into a triple sum by first taking the 
Fourier transform and then taking the logarithm. Processing 
of this signal can be accomplished by a linear system and 
recovery of the waveform can be made by passing the 
processed signal through an exponentator followed by inverse 
Fourier transformer. Thus a homomorphic system for 
processing speech has been developed, as shown in Figure 3*3 
[Ref. 18]. 





L 




-1 



Figure 3. 3- Homomorphic System for Processing Speech 



Variations on this basic system have been developed to 
estimate parameters of both the vocal tract transmission 
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functions and the excitation functions. One of these 
variations involves making the assumption that the 

A 

excitation is s(n) = p(n) * e(n), then equation (3-5) can be 
written as 

x(n) = u(n) * s(n) (3. 6) 

The system to process signals given by equation (3*6) is 
shown in Figure 3* 4 [Ref. 193- 

Referring to Figure 3»^> the signal at A is x(n) and the 
signal at D is called the cepstrum of x(n) and equals the 
cepstra of the excitation plus the cepstra of the vocal 
tract impulse response. 




Data Ceptrum 

Window Window 

Figure 3*^* Cepstral Processing of Speech 



An important feature of the cepstrum at D is that it 
separates the excitation from the vocal tract response. The 
excitation is a sequence of quasi-periodic pulses, thus its 
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Fourier transform, at point B, is a line spectra where the 
lines are spaced at harmonics of the fundamental frequency. 
The log magnitude operation does not effect the general 
shape of the spectra. The IDFT of the signal produces 
another quasi-period ic waveform with pulses spaced at the 
fundamental period. Thus the cepstrum of the excitation 
should consist of pulses around n = 0, T, 2T , where T 

is the pitch period. 

The DFT of the vocal tract response is a slowly varying 
function of frequency. The log magnitude and IDFT yield a 
sequence that is negligible after a few samples. The cep- 
strum at D consists of two sequences, one which is negligi- 
ble after a few samples and one that is periodic. Thus the 
cepstrum at D does differentiate the excitation from the 
vocal tract parameters. The use of the cepstral processing 
has been extended into many diverse fields [Ref. 20]. 

For their third experiment, Cox and Robinson [Ref. 21] 
modified Figure 3*4 by setting the magnitude of the signal 
at point C equal to one. Hence the cepstrum at point D is 
due only to the phase of the signal at A. What amount of 
information and intelligibility does this phase only 
cepstrum contain? Surprisingly the cepstrum was judged to 
be very intelligible by listeners and the noise level was 

reduced when compared with the short-term phase only speech 
(experiment number one). 
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D. INSTANTANEOUS PHASE OF THE ANALYTIC SIGNAL 



The fourth experiment performed by Cox and Robinson 

[Ref. 22] was first performed in 1955 by Marcoui and Daguet 

who were looking for more efficient modulation techniques. 

They sought to use the analytic signal representation of a 

real signal s(t). Given a real signal s(t), which is 

$ 

Hilbert transformable, form a quadrature signal s (t) and 
construct 



From equation (3*7) it is possible to recover the original 
signal as 



Equation 3*8 lets the real signal, s(t), be represented by a 
magnitude and phase. 

The concept of an analytic signal, which equation (3.7) 
is called, was meaningless for discrete-time signals, until 
Rabiner and Schafer [Ref. 23] developed a complex represen- 
tation for real discrete-time bandpass signals. 

Following the notation of Rabiner and Schafer, given a 
real sequence, x(n), with Fourier transform X(w), construct 
a complex sequence 



m( t) = s( t) + j s*( t) 



(3.7) 



s(t) = RE[m(t)] = |m(t)| cos e(t) 



(3.8) 
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(3. 9) 



A 

x(n) = x(n) + j x(n) 

The Fourier transform of which is 



XU) = 2 XU) 


o 

• 

on 

v — / 

V 

3 

v| 

o 


= 0 


IT < (0 < 2 TT 



From equation (3*9) the Fourier transform of x(n) is 

XU) = XU) + j XU) (3- ID 

and from equation (3*10) it follows that 



A 

X ( to) + j XU) 

and 


CvJ 

\/ 

3 

vl 

o 

II 


X ( to ) = 2X( to ) 


0 < to < IT 



These requirements are satisfied if 

XU) = H d U) XU) (3. 12) 

where 



H d («) = -j 


0 < to < IT 


= + j 


IT < to < 2 IT (3*13) 
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Thus given any sequence x(n) , it is possible to obtain the 

A 

sequence x(n) by linear filtering of x(n) with a filter 
whose frequency response is given by equation ( 3 • 13) * Such 

A 

a filter is called an ideal Hilbert transformer and x(n) is 
the Hilbert transform of x(n) . The impulse response of the 
ideal Hilbert transformer is 



h d (n) 



_ • 2 / iii\ 

2 sin (^— ) 
n 



n i 0 (3- 14) 



= 0 



n = 0 



Examining equation (3.14), the impulse response is non- 
causal, of infinite duration, has odd symmetry, and all 
even-numbered samples are equal to zero (i.e., h^( 2 n) = 0 , 
n — 0 , +_ 1 , +_ 2 , ^3 > •••)» 

Since infinite length, non-causal impulse responses are 
not realizable an FIR approximation is required. Given a 
causal FIR system whose impulse response is h(n), 0 _< n _< N-1 , 
its frequency response is given by 

N-1 

H(») = T, h(n)e* j " n (3-15) 

n = 0 

Equation ( 3 . 13) says the desired frequency response, H^(u), is 
purely imaginary. Thus the real part of equation (3*15) must 
equal zero as h(n) is real. In order for the real part of 
equation (3.15) to be zero h(n) must satisfy the symmetry 



condition 
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h(n) = -h(N-l-n) 



n 



0 , 



, N-1. 



( 3 - 16 ) 



If N is odd, h(n) has odd symmetry about n = (N-D/2. If N is 
even, h(n) has odd symmetry about a point halfway between the 
samples at n r N/2 and n = (N/2) + 1. If equation (3.16) is 
satisfied, equation (3*15) can be written as 



H ( a) ) 



e - j <o( N-1)/2 






(3-17) 



£ 

where H («) is a real function of <o. If N is odd, H ( «) can 
be written as 

(N-D/2 

H*(u>) = 2 a(n) sin(ton) (3.18) 

n= 1 

where a(n) = 2h (^~- - n) , n = 1 , 2, ..., (-^^-) (3.19) 

Also for N odd, 



h( 



N-1 

2 



) 



- 0 



( 3 . 20 ) 



For N even, equation (3.18) becomes 







N/2 






H (to) 


= 2 b(n) sin[to(n - 1/2) 


(3.21) 






n= 1 




where 


b ( n ) = 


2h( j!j - n) , n = 1 , . . . , N/2 
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Examining equation (3.17) more closely, we find that the 
factor e "j u (N-1)/2 a delay Q f (N-1)/2 samples. 

In finding an approximation to the ideal Hilbert 

transform, coefficients a(n) and b(n) were chosen in such a 

£ 

fashion that jH (u>) approximates the ideal frequency response 

£ 

given by equation (3.13). Thus H (u>) must approximate 

D(w) = -1 2 ttF^ a) 2 ttF^ (3*22) 

= +1 2 tt ( 1 - F a) <_ 2 ttF 

where F^ and F ^ are the lower and upper cutoff frequencies 

# 

represented as fractions of 2 it . From equation (3.18), H (w) 
must equal zero at u> = 0 and u> = tt when N is odd and must 
equal zero at w = 0 for the case when N is even. 

For the ideal transformer the impulse response was zero 
for all even numbered samples and the frequency response was 
imaginary, odd, periodic and 

( w ) = ( it - o>). 

For the FIR approximation similar properties must be 
valid. If N is odd and F L = . 5 - F H and assuming that 

H*(u>) = H*(» - u>). (3.23) 

Then substituting into equation (3.18) yields, 
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(N-D/2 (N-1)/2 

a(n) sin(noj) = S a(n) sin[(ir - Dn] 
n=1 n= 1 

(N-D/2 

= 2 a(n) (-D n+1 sin ( u n) 

n= 1 

rearranging terms 
(N-D/2 

£ a(n) sin [wn(1 - (-D° +1 )] = 0 

nr 1 



Thus a(n) r 0 n even 

r unconstrained n odd. 

Combining this result with equations (3.16), (3.19), and 
(3*20) have that for (N-D/2 even, h(n) r 0, for n r 0, 2, 
... and when (N-D/2 is odd, h(n) r 0, for n = 1, 3» 5, .... 
For the case of N even no relationship among the 
coefficients exist. 

One important difference between even and odd length 
impulse responses can be seen in direct convolution. The 
convolution summation given by 



x(n) 



N-1 

£ h(k) x(n-k) 
krO 



involves only (N+D/4 multiples per output sample for N odd 
and N/2 multiples for N even. The saving occurs because 
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alternate values of h(n) are zero for N odd. Because of 
this savings and for technical considerations only Hilbert 
transformers of odd length are used. 

In determining the values of h(n), Rabiner and Schafer 
[Ref. 24] used the Remez algorithm for the design of optimal 
FIR filters. The values of h(n) were calculated to minimize 
the peak approximation error which is given by 

G = MAX [DU) - H* U) ] (3-24) 

2ttF^ a) 2 itFh 

The Remex algorithm gives a Chebyshev or equiripple 
approximation to the desired response. Hence the error 
function is equiripple over the range 2 itF^ <_ <u £ 2uF^. 

Given an N, F^ and F^ the resulting approximation is best in 
the mimimax sense. 

Using this concept of an analytic signal representation 
for discrete-time signals, Cox and Robinson [Ref. 251 formed 
the analytic phase representation of a speech signal. Given 

a sampled speech signal, s(n), they calculated the Hilbert 

£ 

transform, s (n) , by the use of a 79-weight Hilbert trans- 
former. Thus having the analytic signal 

£ 

m(n) = s(n) + j s (n) 
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the original signal s(n) is given by 



s( n) r | m( n) | cos 9 (n) . 

The analytic phase representation is given by cos e(n). Thus 
by way of a mathematical artifice a real-valued sequence s(n) 
is represented as having magnitude and phase with the phase 
only being retained. Contrary to common sense, perhaps, this 
analytic phase representation of speech was found to be 
intelligible. While these experiments by themselves do not 
prove that phase is a physical invariant of speech, they do 
indicate that more research is needed to determine to what 
role phase plays in speech intelligibility. 

As was mentioned, a 79-weight Hilbert transformer was 
used in obtaining the analytic signal. Rabiner and Schafer 
[Ref. 26] calculated weights for three different values of 
peak approximation errors and cutoff frequencies. Table 3* 1 
lists these weights and Figures 3.5 through 3.7 are plots of 
the magnitude of the frequency response. Table 3.1 only 
lists even weights, since 79 is odd, all odd weights are 
zero and the weights have odd symmetry about n = 39* 
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TABLE 3.1 



HILBERT TRANSFORMER WEIGHTS 



f l = .01 

G = .0388830 
-.0229388 
-.0075151 
-.0087784 
-.0101565 
-.0117808 
-.0135612 
-.0155902 
-.0179182 
-.0206260 
-.0237742 
-.0274953 
-.0319865 
-.0375627 
-.0447012 
-.0542333 
-.0677331 
-.0885965 
-.1256401 
-.21 11964 
-.6362830 



N = 79 

F l = .02 

G = .0024390 
-.0019358 
-.0017746 
-.0025624 
-.0035600 
-.0048021 
-.0063300 
-.0081910 
-.0104453 
-.0131630 
-.0164470 
-.0204251 
-.0252943 
-.0313515 
-.0390711 
-.0492818 
-.0635544 
-.0852651 
-.1232135 
-.2097186 
-.6357869 



F L = * 05 
G = .0000010 
-.0000041 
-.0000179 
-.0000550 
-.0001389 
-.0003074 
-.0006182 
-.0011532 
-.0020239 
-.0033761 
-.0053956 
-.0083167 
-.0124372 
-.0181511 
-.0260178 
-.0369200 
-.0524475 
-.0759556 
-. 1161821 
-.2053^02 
-.6343000 
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Figure 3.5. Frequency Response of Hilbert Transformer 

N = 79, F l = .01, G = .0388830 
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Figure 3.6. Frequency Response of Hilbert Transformer 



N = 79, F L = 



02, G = .0024390 
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Figure 3.7. Frequency Response of Hilbert Transformer 

N = 79, F l = .05, G = .0000010 
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IV. EXPERIMENTAL PROCEDURE 



This thesis extends the work of Cox and Robinson to the 
isolated word recognition field. Specifically using the 
homomorphic and analytic signal processing techniques 
employed in experiments three and four an isolated word 
recognition system is developed. 

A. DATA ACQUISITION 

In order to form a data base for use by the system 
twenty volunteers were recruited to record the digits zero 
through nine. Each participant was given a questionnaire/ 
instruction sheet like that contained in Appendix A. All 
speakers were males between the ages of 25 and 35 and all 
were native English speakers. Their places of birth varied 
from eastern Pennsylvania to southern Tennessee. Ten of 
these speakers were selected to form the data base or 
pattern base of the system. The other ten speakers were 
used to test the system. 

The speech was recorded on an analog tape recorder with 
all recordings being done in the Speech Processing 
Laboratory. The recordings were done in the late afternoon 
or in the evening when the ambient noise level was at a 
minimum. The tape recorder used was the HP-3964A reel-to- 
reel instrumentation recorder running at 7.5 ips using AMPEX 
professional audio tape. 
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Before this analog speech could be digitized an 
appropriate bandwidth and sampling rate had to be 
determined. The power spectral density of each digit was 
computed and averaged over ten utterances of the digit. The 
majority of the power was found to be below 3 KHz except in 
the case of the number ’six’ where nonnegl igible power was 
found to frequencies up to 6 KHz. A cutoff frequency of 4 
KHz was chosen, which is exactly half the bandwidth that Cox 
and Robinson used. As will be explained later, once the 
bandwidth is fixed the sampling rate is also fixed. In this 
case the sampling rate is fixed at 10 KHz. 

The machine used to digitize the speech was the GENRAD 
2505 Signal Analysis System [Ref. 27]. The system is a 
narrowband (0 - 25 KHz) signal analysis system originally 
designed for vibrational analysis studies. The system uses 
a DEC PDP 11/34A as the host computer and supports two 
channels of A/D conversion. 

The heart of the system, softwar ewise , is GENRAD’s Time 
Series Language (TSL) which allows the operator to control 
the A/D converter. TSL is an interpretive language which 
uses commands similar to BASIC. The TSL program ’ANADSK’ is 
the routine that provides analog input to disk storage. 

Given a bandwidth the ’ANADSK’ routine sets the sampling 
rate at 2.56 times the highest frequency component to 
prevent aliasing. The system provides for high-speed 
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continuous sampling and writes the digitized data to the 
system’s Winchester disks in 2048 byte blocks. 

The two-channel A/D converter has two 6-pole Chebychev 
filters in cascade each with 96 d3/octave rolloff above 
cutoff per channel as anti-aliasing filters. The A/D 
converter is a 2 ysec converter with a 12 bit output. 

Once the speech was digitized a time window for the 
sampled data had to be determined. Referring again to the 
utterances whose power spectral densities were computed, the 
average length of the utterances was 740 msec. In order for 
the mathematics to work out nicely a 750-msec window was 
chosen . 

Using TSL library routines ’RTIO’ and "XDISPL' a routine 
was written that displayed the digitized data on the 
system's CRT. The program graphically displayed 1024 
samples at a time and allowed the operator to select any 256 
samples for transfer to the W. R. Church Computer Center’s 
IBM 3033 for processing. This transfer was via a 1200 baud 
modem. With the capability to view the data prior to 
transfer, the start of the utterance could be selected to 
within 128 samples. Since the time window was selected to 
be 750 msec and the speech was sampled at 10,240 samples/ 
sec, 7680 points needed to be transferred. Thus thirty 
blocks of 256 samples each were transferred per utterance. 
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The transfer/interface program between the Speech Lab’s 
PDP 1 1 /34A and the IBM 3033 was written by LT Jay H. Benson. 
A copy of his program, ’CATCH’, is included in Appendix B. 
The transfer of data via the modem was very time consuming 
as for technical reasons each sample which occupied two 
bytes on the PDP 11/3 ^A was made into a four byte number for 
transfer. The sixteen most significant bits were then 
masked off prior to storage on the IBM system. In order to 
minimize the amount of disk storage required, the data was 
written to the disk using an unformatted FORTRAN write 
statement, using Integer * 2 numbers. Even using this 
scheme to maximize storage efficiency 24 cylinders plus 
magnetic tape backup were required to store the data. 

B. DATA PROCESSING 

The decision to use the IBM system to process the data 
was based on the availability of library routines (e.g., 
IMSL, NONIMSL), the DISSPLA graphics package, and the full 
screen text editor. All programs in Appendix B were written 
in FORTRAN H. 

The first task was to compute an average waveform for 
the speaker. In order to accomplish this, three of the 
four utterances of each of 10 speakers were averaged 
together. The program ’MEANS’ was used to compute this 
average. The technique is very simple and straightforward 
as the ensemble mean was computed. This agrees with the 
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work done by the Air Force [Ref. 28] where they assumed that 
the samples are statistically independent, identically 
distributed Gaussian random variables. This is an over 
simplification as it is known that the vocal tract is slowly 
varying with the tract parameters changing only every 10 
msec . 

The short-term cepstral representation of the averaged 
waveform was computed using the program ’CEP’ . In keeping 
with Cox and Robinson the waveform was segmented into 
25 msec parts and each part was processed in sequence. 

Finally the analytic signal representation of the 
waveform was computed using a FIR Hilbert transformer with 
79 weights, and a lower cutoff frequency of .05. The 
frequency response of this filter is shown in Figure 3*3* 
This particular filter was chosen over the other two 79 
weight filters because of its very small approximation 
error. The small approximation error does imply that the 
transition band of this filter is larger than the other two 
filters, however, this was deemed less important than the 
peak approximation error. 

Examples of these three representations of the same 
utterances can be found in Figures 4.1 thru 4.30. These 
examples are of a male 30 years old, born and raised in 
eastern Pennsylvania, and a Naval cryptologic officer. In 
order to display all 7680 points on one graph the waveform 
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was first normalized, then divided into four 1920 point 
parts. Each part was biased by ( N — 1 ) * 2, where N = 1, 2, 

3, 4, to permit graphing by the four segments on one page. 

The graphs should be read from left to right, top to bottom. 

C. DECISION ALGORITHM 

Once the speech had been processed a decision algorithm 
had to be formulated to classify utterances based on the 
patterns collected. All of the isolated word recognizers 
use a form of classical pattern recognition to classify 
utterances. The VRM system uses a nearest neighbor 
algorithm with a variable threshold. If no utterance is 
within the distance specified by the threshold, an unable to 
classify message is issued. 

The nearest neighbor rule is an example of the pooled 
form of the nearest neighbor rule [Ref. 29]. For the two 
class case, a hemisphere is formed around the vector y to 
include k total samples regardless of their class. Thus 

+ k .2 = k, where k^ equals the number of vectors belonging 
to class i. The quotient k^/k 2 is formed and compared to 
one. If k^/k 2 > 1, then this implies there are more class 
one vectors in the hemisphere around y and the vector y is 
said to belong to class one. If the converse of the 
inequality is true, k^/k 2 < 1, then y is said to belong to 
class two. The probability of error for the case k= 1 is 
less than twice the minimum probability of error for any 
decision rule. 
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The nearest neighbor rule was employed to classify the 
utterances. Using the program ’DEC’, the Euclidean distance 
between a test vector and the stored patterns was computed. 
The results of this pattern matching are discussed in the 
next chapter . 
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Figure 4.1. Sampled Waveform, Zero 
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Figure 4.2. Analytic Representation of Zero 
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Figure 4.3. Cepstral Representation of Zero 
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Figure 4.4. Sampled Waveform, One 



54 



320 640 960 1280 1600 1920 

SRMPLE NUMBERS (K) 




Figure 4.5. Analytic Representation of One 
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Figure 4.6. Cepstral Representation of One 
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Figure 4.7. Sampled Waveform, Two 
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Figure 4.8. Analytic Representation of Two 
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Figure 4.9. Cepstral Representation of Two 
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Figure 4.10. Sampled Waveform, Three 



60 



320 640 960 1280 1600 1920 

SAMPLE NUMBERS (K) 




33UI10A craznuwaoN 



Figure 4.11. Analytic Representation of Three 
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.12. Cepstral Representation of Three 
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Figure 4.13. Sampled Waveform, Four 
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Figure 4.14. Analytic Representation of Four 
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Figure 4.15. Cepstral Representation of Four 
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Figure 4.16. Sampled Waveform, Five 
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Figure 4.17. Analytic Representation of Five 
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39UI10A Q3ZnywyON 



Figure 4.18. Cepstral Representation of Five 
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30UI1OA CGZnUWyON 



Figure 4.19. Sampled Waveform, Six 
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30UI1OA Q3ZI1UW30N 



Figure 4.20. Analytic Representation of Six 
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Figure 




30UI1OA (DZUdWaON 

.21. Cepstral Representation of Six 
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39UI10A Q3ZI1MI0N 



Figure 4.22. Sampled Waveform, Seven 
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39UI10A Q3ZI1UW30N 



Figure 4.23. Analytic Representation of Seven 
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Figure 4.24. Cepstral Representation of Seven 
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4.25. Sampled Waveform, Eight 
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39U.no A Q3ZnUUU0N 



Figure 4.26. Analytic Representation of Eight 
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Figure 4.27. Cepstral Representation of Eight 
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Sampled Waveform, Nine 
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Figure 4.29. Analytic Representation of Nine 
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Figure 4.30. Cepstral Representation of Nine 
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V. RESULTS AND CONCLUSIONS 



Ten speakers were selected to form the data base for the 
system. Their utterances were processed to obtain both 
their cepstral and analytic phase representations. The 
system was then tested using two groups of speakers. The 
first group, denoted Group A, consisted of speakers whose 
utterances were used to form the data base. Each speaker 
repeated the digits four times, and only three of these 
utterances were used to compute the average waveform and 
hence the cepstral and analytic phase representations. 

Group A can be thought of as having trained the system. The 
second group, Group B, consists of the other ten speakers. 

The system was tested using ten utterances per digit 
from each of the two groups of speakers. The reference 
pattern space was varied, using three different spaces each 
containing 100 patterns. The cepstral and analytic 
representations formed two of the reference spaces, while 
the unprocessed signals formed the third space. Tables 5.1 
and 5.2 contain the results of the test. 

The results for Group A, in all categories, are below 
the results attainable with the VRM system. For three 
training passes the VRM system has a 97% recognition rate. 
The high percentage of recognition for the unprocessed 
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waveforms was to be expected since the speakers trained the 
system and the pattern space did consist of the average of 
each speaker's utterances. The distances between the 
pattern vectors and the test vector were of the same 
magnitude for the unprocessed waveforms, regardless of 
whether the utterance was correctly identified or not. In 
the case of the short-term phase representations when the 
system correctly identified an utterance, the distance 
between the test vector and its nearest neighbor was an 
order of magnitude less than all the other distances. When 
the system incorrectly identified an utterance all distances 
were of the same magnitude. 

The success demonstrated in the speaker-dependent case 
is not without cost. As compared to the VRM system, which 
has at most 120 bits/pattern, this system has 122. 8K 
bits/pattern (7680 two byte numbers). There was an 
extensive amount of manual editing involved to obtain these 
patterns, on the order of ten minutes per utterance. 

However , it was shown that short-term phase-only speech can 
be used to construct a speaker-dependent isolated word 
recognizer . 

The results for Group B appear to be abysmal, however, 
several things must be considered. First, there was no pre- 
processing of the signals to time-wrap them. Second, no 
features were extracted, only the entire waveforms were 
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used. Third, the decision algorithm may have to be tailored 
to fit the data , rather than using a general purpose 
decision rule. Last, but certainly not least, no system 
exists today that is completely speaker independent. 

One final observation concerning Group B. When the 
decision algorithm incorrectly identified any utterance it 
did so with a great deal of bias. In 30% of the cases where 
an utterance was incorrectly identified the number ’one' was 
picked to be the nearest neighbor. 

This thesis was not an attempt to definitively answer 
the question, "is phase a physical invariant of speech?". 

Its purpose was to show that phase should be considered when 
constructing a word recognition system. This was accom- 
plished. The next step is to use the information obtained 
from the phase in conjunction with other word recognition 
systems to possibly improve these systems with the long 
range goal of solving the speaker-independent word 
recognition problem. 
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TABLE 5. 1 



GROUP A RECOGNITION RESULTS 
BASED ON TEN UTTERANCES PER DIGIT 



Digits 


Unprocessed 
Wa vef orms 


Cepstr al 
Representation 


Anal yt ic 
Representation 


0 


9 


7 


6 


1 


10 


8 


7 


2 


10 


6 


4 


3 


9 


5 


3 


4 


10 


5 


5 


5 


10 


7 


5 


6 


8 


4 


3 


7 


9 


4 


3 


8 


10 


6 


7 


9 


10 


7 


6 


AVG 


9.5 


5.9 


4.9 
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TABLE 5.2 



GROUP B RECOGNITION RESULTS 
BASED ON TEN UTTERANCES PER DIGIT 



Digits 


Unprocessed 

Waveforms 


Cepstr al 
Representation 


Analytic 

Representation 


0 


2 


0 


1 


1 


4 


3 


3 


2 


1 


1 


0 


3 


2 


0 


0 


4 


1 


0 


1 


5 


1 


0 


0 


6 


0 


0 


0 


7 


1 


0 


0 


8 


0 


1 


0 


9 


2 


1 


0 


AUG 


1.4 


. 6 


.5 
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APPENDIX A 



INSTRUCTION SHEET 



Thank you for participating in the Speech Processing 
Lab’s effort to collect speech samples. This exercise will 
require about 10 minutes of your time to complete. 

I. Biographical Data 

A. Name: 

B. Age: 

C. Sex: 

D. Place of Birth: 

E. Occupation: 

II. Speech Sampling 

A. Repeat each word on the list four times, pausing 
approximately 5 sec. between utterances. (For example: the 

first word on the list is 'zero', therefore you would say: 
'zero' (pause) 'zero' (pause) 'zero' (pause) 'zero' (pause) 
'one' (pause) ) 

zero six 



one 

two 

three 

four 

five 



seven 

eight 

nine 



B. Repeat the following exercise 3 times: 

Read the entire list of numbers at your natural speaking 
rate pausing approx. 5 secs, before repeating the list. Do 
not pause unnaturally between the numbers. We are looking 
for continuous speech such as in a conversation. 

zero-one-two-thr ee- four- fiv e-six-si ven-eight-n in e (pause/repeat) 
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APPENDIX B 



COMPUTER PROGRAMS 

All programs were written in IBM FORTRAN H to run on the 
W. R. Church Computer Center’s IBM 3033* The programs 
access routines from the ISML library. The graphics 
programs interact with the DISSPLA graphics package. 
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